Directions

Students are encouraged to work together on homework. However, sharing, copying or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible.

Final submissions must be uploaded to our Compass 2g site on the Homework page. No email, hardcopy, or late submissions will be accepted.

Assignment

Exercise 1 (EPA Emissions Data)

For this exercise we will use the data stored in epa2015.csv. It contains detailed descriptions of 4,411 vehicles manufactured in 2015 that were used for fuel economy testing as performed by the Environment Protection Agency. The variables in the dataset are:

  • Make - manufacturer
  • Model - model of vehicle
  • ID - manufacturer defined vehicle identification number within EPA’s computer system (not a VIN number)
  • disp - cubic inch displacement of test vehicle
  • type - car, truck, or both (for vehicles that meet specifications of both car and truck, like smaller SUVs or crossovers)
  • horse - rated horsepower, in foot-pounds per second
  • cyl - number of cylinders
  • lockup - vehicle has transmission lockup; N or Y
  • drive - drivetrain system code
    • A = All-wheel drive
    • F = Front-wheel drive
    • P = Part-time 4-wheel drive
    • R = Rear-wheel drive
    • 4 = 4-wheel drive
  • weight - test weight, in pounds
  • axleratio - axle ratio
  • nvratio - n/v ratio (engine speed versus vehicle speed at 50 mph)
  • THC - total hydrocarbons, in grams per mile (g/mi)
  • CO - Carbon monoxide (a regulated pollutant), in g/mi
  • CO2 - Carbon dioxide (the primary byproduct of all fossil fuel combustion), in g/mi
  • mpg - fuel economy, in miles per gallon 徐昊飞 We will attempt to model CO2 using both horse and type. In practice we would use many more predictors, but limiting ourselves to these two, one numeric and one factor, will allow us to create a number of plots.

(a) Load the data, and check its structure using str(). Verify that type is a factor; if not, coerce it to be a factor.

(b) Make a scatterplot of CO2 versus horse. Use a different color point for each vehicle type. Which color is which type?

(c) Fit a SLR model with CO2 as the response and only horse as the predictor. Recreate your plot and add the fitted regression line. Comment on how well this line models the data. Give an estimate for the average change in CO2 for a one foot-pound per second increase in horse for a vehicle of type truck. Give a 95% prediction interval using this model for the CO2 of a Subaru Impreza Wagon, which is a vehicle with 148 horsepower and is considered type Both. (Interestingly, the dataset gives the wrong drivetrain for most Subarus in this dataset, as they are almost all listed as F, when they are in fact all-wheel drive.)

(d) Fit an additive multiple regression model with CO2 as the response and horse and type as the predictors. Recreate your plot and add the fitted regression “lines” with the same colors as their respective points. Comment on how well these lines model the data. Give an estimate for the average change in CO2 for a one foot-pound per second increase in horse for a vehicle of type truck. Give a 95% prediction interval using this model for the CO2 of a Subaru Impreza Wagon, which is a vehicle with 148 horsepower and is considered type Both.

(e) Fit an interaction multiple regression model with CO2 as the response and horse and type as the predictors. Recreate your plot and add the fitted regression “lines” with the same colors as their respective points. Comment on how well these lines model the data. Give an estimate for the average change in CO2 for a one foot-pound per second increase in horse for a vehicle of type truck. Give a 95% prediction interval using this model for the CO2 of a Subaru Impreza Wagon, which is a vehicle with 148 horsepower and is considered type Both.

(f) You will perform \(F\)-tests later in the exercise, but for now, based solely on the three previous plots, which model is preferred: SLR, additive, or interaction?

(g) Use an ANOVA \(F\)-test to compare the SLR and additive models. Based on this test and a significance level of \(\alpha = 0.01\), which model is preferred?

(h) Use an ANOVA \(F\)-test to compare the additive and interaction models. Based on this test and a significance level of \(\alpha = 0.01\), which model is preferred?

Exercise 2 (Hospital SUPPORT Data)

For this exercise we will use the data stored in hospital.csv. It contains a random sample of 580 seriously ill hospitalized patients from a famous study called “SUPPORT” (Study to Understand Prognoses Preferences Outcomes and Risks of Treatment). As the name suggests, the purpose of the study was to determine what factors affected or predicted outcomes, such as how long a patient remained in the hospital. The variables in the dataset are:

  • Days - Days to death or hospital discharge
  • Age - Age on day of hospital admission
  • Sex - female or male
  • Comorbidity - Patient diagnosed with more than one chronic disease
  • EdYears - Years of education
  • Education - Education level; high or low
  • Income - Income level; high or low
  • Charges - Hospital charges, in dollars
  • Care - Level of care required; high or low
  • Race - Non-white or white
  • Pressure - Blood pressure, in mmHg
  • Blood - White blood cell count, in gm/dL
  • Rate - Heart rate, in bpm

For this exercise, we will use Charges, Pressure, Care, and Race to model Days.

(a) Load the data, and check its structure using str(). Verify that Care and Race are factors; if not, coerce them to be factors. What are the levels of Care and Race?

(b) Fit an additive multiple regression model with Days as the response using Charges, Pressure, Care, and Race as predictors. What does R choose as the reference level for Care and Race?

(c) Fit a multiple regression model with Days as the response. Use the main effects of Charges, Pressure, Care, and Race, as well as the interaction of Care with each of the numeric predictors as predictors. (that is, the interaction of Care with Charges and the interaction of Care with Pressure). Use a statistical test to compare this model to the additive model using a significance level of \(\alpha = 0.01\). Which do you prefer?

(d) Fit a multiple regression model with Days as the response. Use the predictors from the model in (c) as well as the interaction of Race with each of the numeric predictors. (that is, the interaction of Race with Charges and the interaction of Race with Pressure). Use a statistical test to compare this model to the additive model using a significance level of \(\alpha = 0.01\). Which do you prefer?

(e) Using the model in (d), give an estimate of the change in average Days for a one-unit increase in Pressure for a "white" patient that required a high level of care.

(f) Find a model using the four predictors that we have been considering that is more flexible than the model in (d) and that is also statistically significant as compared to the model in (d) at a significance level of \(\alpha = 0.01\).

Exercise 3 (Fish Data)

For this exercise we will use the data stored in fish.csv. It contains data for 158 fish of 7 different species all gathered from the same lake in one season. The variables in the dataset are:

  • Species - Common name (Latin name)
    • 1 = Bream (Abramis brama)
    • 2 = Whitewish (Leuciscus idus)
    • 3 = Roach (Leuciscus rutilus)
    • 4 = (Abramis bjoerkna)
    • 5 = Smelt (Osmerus eperlanus)
    • 6 = Pike (Esox Lucius)
    • 7 = Perch (Perca fluviatilis)
  • Weight - Weight of the fish, in grams
  • Length1 - Length from the nose to the beginning of the tail, in cm
  • Length2 - Length from the nose to the notch of the tail, in cm
  • Length3 - Length from the nose to the end of the tail, in cm
  • HeightPct - Maximal height as % of Length3
  • WidthPct - Maximal width as % of Length3
  • Sex - 0 = female, 1 = male

We will attempt to predict Weight using Length1, HeightPct, and WidthPct.

(a) Use R to fit the model

\[ Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \beta_6 x_2 x_3 + \beta_7 x_1 x_2 x_3 + \epsilon, \]

where

  • \(Y\) is Weight
  • \(x_1\) is Length1
  • \(x_2\) is HeightPct
  • \(x_3\) is WidthPct.

Report the estimated coefficients of the model.

(b) Consider fitting a smaller model in R.

fish_smaller = lm(Weight ~ Length1 + HeightPct * WidthPct, data = fish)

Use a statistical test to compare this model with the previous. Report the following:

  • The null and alternative hypotheses in terms of the model given in (a)
  • The value of the test statistic
  • The p-value of the test
  • A statistical decision using a significance level of \(\alpha = 0.05\)
  • Which model you prefer

(c) Give an expression based on the model in (a) for the true change in average weight for a 1 cm increase in Length1 for a fish with a HeightPct of 20 and a WidthPct of 10.

(d) Give an expression based on the smaller model in (b) for the true change in average weight for a 1 cm increase in Length1 for a fish with a HeightPct of 20 and a WidthPct of 10.

Exercise 4 (\(t\)-test Is a Linear Model)

In this exercise, we will try to convince ourselves that a two-sample \(t\)-test assuming equal variance is the same as a \(t\)-test for the coefficient in front of a single factor variable in a linear model.

First we setup the data frame that we will use throughout.

n = 16

ex4 = data.frame(
  groups = c(rep("A", n / 2), rep("B", n / 2)),
  values = rep(0, n))
str(ex4)
## 'data.frame':    16 obs. of  2 variables:
##  $ groups: chr  "A" "A" "A" "A" ...
##  $ values: num  0 0 0 0 0 0 0 0 0 0 ...

We will use a total sample size of 16, 8 for each group. The groups variable splits the data into two groups, A and B, which will be the grouping variable for the \(t\)-test and a factor variable in a regression. The values variable will store simulated data.

We will repeat the following process a number of times.

ex4$values = rnorm(n, mean = 10, sd = 3) # simualte data
summary(lm(values ~ groups, data = ex4))
## 
## Call:
## lm(formula = values ~ groups, data = ex4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3590 -1.0837  0.3642  1.0391  5.5094 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.263      1.005   9.217 2.54e-07 ***
## groupsB        2.045      1.421   1.439    0.172    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.843 on 14 degrees of freedom
## Multiple R-squared:  0.1288, Adjusted R-squared:  0.06655 
## F-statistic: 2.069 on 1 and 14 DF,  p-value: 0.1723
t.test(values ~ groups, data = ex4, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  values by groups
## t = -1.4386, df = 14, p-value = 0.1723
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.093195  1.003801
## sample estimates:
## mean in group A mean in group B 
##        9.263231       11.307928

We use lm() to test

\[ H_0: \beta_1 = 0 \]

for the model

\[ Y = \beta_0 + \beta_1 x_1 + \epsilon \]

where \(Y\) are the values of interest, and \(x_1\) is a dummy variable that splits the data in two. We will let R take care of the dummy variable.

We use t.test() to test

\[ H_0: \mu_A = \mu_B \]

where \(\mu_A\) is the mean for the A group, and \(\mu_B\) is the mean for the B group.

The following code sets up some variables for storage.

num_sims = 100
lm_t = rep(0, num_sims)
lm_p = rep(0, num_sims)
tt_t = rep(0, num_sims)
tt_p = rep(0, num_sims)
  • lm_t will store the test statistic for the test \(H_0: \beta_1 = 0\).
  • lm_p will store the p-value for the test \(H_0: \beta_1 = 0\).
  • tt_t will store the test statistic for the test \(H_0: \mu_A = \mu_B\).
  • tt_p will store the p-value for the test \(H_0: \mu_A = \mu_B\).

The variable num_sims controls how many times we will repeat this process, which we have chosen to be 100.

(a) Set a seed equal to your UIN. Then write code that repeats the above process 100 times. Each time, store the appropriate values in lm_t, lm_p, tt_t, and tt_p. Specifically, each time you should use ex4$values = rnorm(n, mean = 10, sd = 3) to update the data. The grouping will always stay the same.

(b) Report the value obtained by running mean(lm_t == tt_t), which tells us what proportion of the test statistics are equal. The result may be extremely surprising!

(c) Report the value obtained by running mean(lm_p == tt_p), which tells us what proportion of the p-values are equal. The result may be extremely surprising!

(d) If you have done everything correctly so far, your answers to the last two parts won’t indicate the equivalence we want to show! What the heck is going on here? The first issue is one of using a computer to do calculations. When a computer checks for equality, it demands equality; nothing can be different. However, when a computer performs calculations, it can only do so with a certain level of precision. So if we calculate two quantities we know to be analytically equal, they can differ numerically. Instead of mean(lm_p == tt_p) run all.equal(lm_p, tt_p). This will perform a similar calculation, but with a very small error tolerance for each equality. What is the result of running this code? What does it mean?

(e) Your answer in (d) should now make much more sense. Then what is going on with the test statistics? Take a look at the values stored in lm_t and tt_t. What do you notice? Is there a relationship between the two? Can you explain why this is happening?